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ABSTRACT. A general framework is that the estimators of a distribution are obtained 
by minimizing a function (the estimating function) and they are assessed through another 
function (the assessment function). The estimating and assessment functions generally esti- 
mate risks. A classical case is that both functions estimate an information risk (specifically 
cross entropy); in that case Akaike information criterion (AIC) is relevant. In more gen- 
eral cases, the assessment risk can be estimated by leave-one-out crossvalidation. Since 
leave-one-out crossvalidation is computationally very demanding, an approximation for- 
mula can be very useful. A universal approximate crossvalidation criterion (UACV) for 
the leave-one-out crossvalidation is given. This criterion can be adapted to different types 
of estimators, including penalized likelihood and maximum a posteriori estimators, and of 
assessment risk functions, including information risk functions and continuous rank prob- 
ability score (CRPS). This formula reduces to Takeuchi information criterion (TIC) when 
cross entropy is the risk for both estimation and assessment. The asymptotic distribution 
of UACV and of a difference of UACV is given. UACV can be used for comparing esti- 
mators of the distributions of ordered categorical data derived from threshold models and 
models based on continuous approximations. A simulation study and an analysis of real 
psychometric data are presented. 
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1 Introduction 



Akaike information criterion (AIC) (Akaike 1973| l is useful for comparing parametric 
models. AIC assumes parametric models and maximum likelihood estimators; it has been 



developed from the Kullback-Leibler divergence. Commenges et al. ( 2008 1 showed that a 
normalized difference of AIC of two models could be considered as estimating a difference 
of Kullback-Leibler risks of maximum likelihood estimators of the density based on the 
two models. Likelihood crossvalidation (LCV) has also been widely used for comparing 



parametric models. Stone ( 1974 1 showed that LCV was asymptotically identical to AIC. 
LCV however is more flexible in that it can be applied to other estimators than maximum 



likelihood estimators (MLE), for instance to penalized likelihood estimators: see Golub 



et al. ( 1979] ); Wahba ( 1985 1. It has asymptotic optimality properties (Van Der Laan et al. 



|2004[ ). The leave-one-out crossvalidation is the most natural and one of the most efficient 
but it is also the most computationally demanding so that approximation formulas have 
been derived. Stone ( 1974| l was the first to give an approximation formula. Other works 
developed generalized approximation crossvalidation for smoothing parameter selection 
for penalized splines (Xiang and Wahba 1996; G u and Xiang| |200"T| or for penalized like- 
lihood (O'Sullivan, 1986; Commenges et al. 2007| l. Cross-validation can also be applied 
to other assessment risks than Kullback-Leibler risk; for a review see |Arlot and Cehssej 
( |20TQ> . 

We consider the following framework: estimators of the true density function are de- 
fined as minimizing an estimating function; the estimating function itself can be viewed as 
an estimator of a risk, that we call an "estimating risk"; the estimators of the true density 
are assessed using an "assessment risk". The assessment risk can be estimated by crossval- 
idation, allowing a choice between them. Leave-one-out crossvalidation is one of the best 
crossvalidation procedure but is very computationally demanding. The aim of this paper 
is to find a universal approximation for leave-one-out crossvalidation, valid whatever the 
estimating and assessment risks. 

Section 2 presents the framework and the universal approximate criterion (UACV); sec- 
tion 3 shows how it specializes to particular cases. In section 4 the asymptotic distributions 
of UACV and of a difference of UACV are given. Section 5 presents a simulation study, 
while section 6 presents an illustration of the use of UACV for comparing estimators de- 
rived from threshold models and estimators obtained by continuous approximations in the 
case of ordered categorical data with repeated measurements. Section 7 concludes. 
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2 The universal approximate crossvalidation criterion 



2.1 The risk and its estimation by crossvalidation 

Suppose that a sample of independently identically distributed (iid) variables O n = (Yi,i = 
1, . . . , n) is available. Based on O n , an estimator g e (where 8 is short for 9 n ) of the proba- 
bility density function /* of the true distribution can be chosen in a family of distributions 
(g e )g^Q, 8 C Estimators are chosen as minimizing an estimating function §Q n (9) 
where $<5 (0) — rT 1 tnat i s ' ^ = argmin e $g (6). We shall assume that 

<j>(8, y) is a thrice differentiable function of 8 for any y. Under mild conditions $g con- 
verges toward $00(0) = E*{</>(#, Yi)}, where E* means that the expectation is taken under 
the true probability specified by /* . Thus, we can think of these estimating functions as 
estimators of a risk function, which by definition is the expectation of a loss function. For 
instance if the loss is 4>{9, Y) = - log g e {Y), §o n {9) estimates the risk E*{- log g e (Y)}; 
this risk is the cross entropy of g e relative to /* (see section 



3.1 



. Maximum likelihood es- 



timators (MLE) are derived this way (Akaike, 1973; Commenges 2009). A more general 
class of estimators of this form is that of M-estimators. Under some conditions given in 



Van der Vaart 



(2000 1, 9 converges in probability toward 9q = argmin <i> oo (0). Several 
estimating loss functions depend on 9 only through g e but some may depend on 8 directly, 
for instance through penalty terms. 

Let us consider another loss function ip ( g e , Y ) which defines a risk function ( g e ) = 
E*{i/)(<7 61 , Y)}; we shall assume that ip(g , y) is a twice differentiable function in 9 for all 
values of y. This risk function may serve for assessing estimators, but since estimators are 
random, the assessment risk is an expected risk: ER^'(g e ) — F,^{^p(g e , Y)}. Estimators 
with small assessment risks are preferred. The problem is to estimate the assessment risk 
(without knowing the true density /*). This will be easier if the loss itself does not involve 
/*; this is the case for instance of ip(8,Yi) = — log g e {Yi) but not of the integrated squared 
error ISE(g e ) = f{g e {u) — f*(u)} 2 du. The expectation however is with respect to the 
true distribution, so the risk cannot be computed exactly but has to be estimated. If another 
sample 0' n — (Y- , i = 1, . . . , n) iid with respect to O n were available, a natural estimator 
of the risk would be ER (g e ) = tT 1 YTi=\ VKs 9 , Y(). This is an unbiased estimator of 
the risk. This is often used by practitioners who split their original sample in a training and 
a validation sample. However this practice leads to a loss of efficiency since only half of 
the data are used for computing the estimator g e and half of the data also for estimating its 
assessment risk. Crossvalidation estimators make a more efficient use of the information. 
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In particular the leave-one-out crossvalidation criterion is: 

n 

C\{g § )=n- 1 Y,^9 6 - i ,Y i ), 

i=l 

where 9-i — argmin &o nli and 3?© , 4 = ^ry 4>{9, Yj). CV does nearly as well as if 
another sample 0' n were available. Indeed it can immediately be seen that E{CV(g )} = 
ER^" (g** n - 1 ) . We shall see its asymptotic distribution in section]^ For comparing two esti- 
mators the difference of risks is relevant. This can of course be estimated by the difference 
of the estimated risks whose asymptotic distribution will also be given in section|4] 



2.2 The universal approximate crossvalidation criterion 

The leave-one-out crossvalidation criterion may be computationally demanding since it is 
necessary to run the maximization algorithm n times for finding the i = 1, . . . , n . For 
this reason an approximate formula is very useful. This is possible because 9^i is close to 9. 



Indeed, under mild conditions given in Van der Vaart (2000 1, y/n(9 — 9o) has an asymptotic 
normal distribution; since this is also true for 9^i, this implies that — 9 = O p (n^ 1 / 2 ). 
A Taylor expansion of — gj)\§ around 9 yields: 

0-i-9= - H $ 0nl . qq * \§ + R n, 
d 2 <S> « 

where -f/$ e = — de ? \§, and R n is a quadratic form of 6>_, — 9 involving second and 
third derivatives of §o nli taken in 9 so that \\6 n - §\\ < \\9-i - 9\\. Thus \ \9 n - 9\ \ is 
also an OplnT 1 / 2 ). In virtue of the strong law of large numbers <&o nl . and its derivatives 
converge almost surely toward their expectations taken at 9 . By definition of §q (6) we 
have the relation: 

n$d„(«) = (n- l)*e B | 4 W + W,Yi). (1) 
Taking derivatives of the terms of this equation and taking the values at 9 we find = 
(n — 1) — §if^ L \§ + d ^g^ \§ an< l we obtain that — §i^~\e = — d* where 

- i am, 

d « = T w — e- ( 2 ) 

n — 1 69 

Note that on the right hand of the equation we could use a multiplicative factor — instead 
of -^r since the difference between the two is a 0(n~ 2 ). Hence we have: 

6-i — 6 = H iS di + Rn, (3) 

Note that this implies that 9-i — 9 = Op(n. _1 ) because Hq, ^ = O p (l) and both di 
and R n are O p (ri _1 ). But this in turn implies that R n is in fact an O p (nT 2 ). By twice 
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derivating 



we obtain: 77$ . 



i— 1 u 



hHs,^ where 77 



a 2 * 



the last term is an p (n 1 ), we can replace 77$ by 77<j 



Oi l 

since 



9 g e f n | g in ( 3 i and obtain: 



L i -6 = H^J i + O p {n- 2 ). 
Developing now the assessment loss function for 9-i around 9 yields: 

1>{g S -*,Yi) = i>(g § , Y t ) + (Li - 9) T v t + O p (n~ 2 ), 



(4) 



where 
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(5) 



Replacing in this equation — 9 by its approximation in (|4|) we obtain: ip(g > { , Yi) = 
ip{g e , Yi) + djH^ Vi + O p (n~ 2 ). Taking the mean of the left terms of these equations 
yields CV(g e ) so that we have: 



CY(g e ) = y(g e )+Tmce(H^K) + O p (n~ 2 ), (6) 
where ) = n" 1 YT l= i <P(g § , Y) and K = rT 1 YT i= i ^ rf T- We define UACV as: 

UACV(/) - <&(/) + Trace(77 $ ^ K), (7) 

2.3 Extension to incompletely observed data 

More generally the available observation is O n = (Oi,i = 1, . . . , n) where Oi are sigma- 
fields containing observation of events generated by Y^ Classical cases are that of right- 
censored observations common in survival analysis, left-censored observations common in 
biological measurements with detection limits, interval-censored data common in observa- 
tions from cohort studies. Then risks functions of the type <p(9, Yi) and ip(g e ,Yi) cannot 
be computed from observations (they are not 0„-measurable). The theory above does not 
apply and attempts to estimate such functions require that the model is well-specified, a 
very strong assumption in this context. A solution is to consider loss functions of the form 
4>(9, Oi) and ip(g , Oi), to which the above results apply. The most natural choice is the 
log-likelihood. Such an approach is in fact heuristically taken when using AIC with cen- 
sored data. Some theory and simulations have been given in Commenges et al. ( 2007| l, 



Liquet and Commenges (2011 1 and Commenges et al. ( 20 1 2 1 . 
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3 Particular cases of UACV 



and Thomas 



1991 



3.1 Maximum likelihood estimators and information risk: AIC 

Suppose we take: 4>(9,Yi) = ip(g e ,Yi) = — logg e (Yi). Then, the estimating function is 
minus the loglikelihood which estimates the estimating risk, here the cross-entropy ( |Cover| 
ofg 9 with respect to the true density /*: E*{- log g e (Y)} = H(f*) + 
KL(g 6 ; /*), where H(f*) = -E„{log f*(Y)} is the entropy of /* and KL(g 6 ;f*) = 
E*{log g„ < jhQ } the Kullback-Leibler divergence of g° relative to /*. As for the assessment 
risk, this is the expected cross entropy: 

ECE(/) =E4E4-log/(F)|0 n }] =H(f)+EKL(g § ;f), (8) 

where EKL(/;/*) = E*{log£^}. In that case the loss functions for estimating and 
assessment are the same. The estimating and assessment risks are nearly the same; there is 
however a dissymmetry in that the estimating risk is a cross entropy while, because g is 
random, the assessment risk is an expected cross-entropy. 

In that case the leading term of |7]i is minus the maximized (normalized) loglikelihood. 
We have also that z), is the individual score and di — ^rf^i so that UACV is identical to a 
normalized version of Takeuchi information criterion (TIC). If the model is well specified 
K tends in probability toward II, where 2/, is the individual information matrix. The 
Hessian also tends toward II so that the correction term tends toward p, the number 

of parameters. Thus, if the model is not too badly specified, TIC is approximately equal to 
AIC. We have UACV = J^TIC s» j^AIC, and this estimates the expected cross-entropy 
of the estimator. 



3.2 Maximum a posteriori and maximum penalized likelihood estima- 
tors 

Estimators can be obtained by maximizing the maximum a posteriori (MAP) density of 
9: Y^i=x l°E9 {Yi) ~ where J{0) is minus the log of the prior density of 9. This 

corresponds to a loss function cj)(6, Y) = — logg (Y) + n~ 1 J(9). Here the loss function 
depends on n but since the MAP estimators also converge (toward the same limit as the 
MLE), the above results still hold so that UACV can be applied to the MAP estimators. 
The assessment risk can be the expected cross entropy. Thus, in that case the estimating 
risk and the assessment risk are clearly different. 
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In the case where at least one of the parameters is a function, penalized likelihood can be 
applied for obtaining a smooth estimator ( jO'Suihvan] |T986). The function is approximated 
on a spline basis so that the problem reduces to a parametric one with a large number of 
parameters. This leads to cf>(9, Yi) = — \ogg(Yi; 9) + J(9, n), where J(9, n) is a penalty, 
which involves for instance the norm of the second derivative of the function and which may 
depend on n. J(9, n) may depend on n in such a way that the estimator converges. The 



assessment risk can be the expected cross entropy as in section 3.1 In that case the leading 
term in |7]) is minus the (normalized) loglikelihood taken at the value of the penalized 
likelihood estimator, 9. We still have that Vi is the individual score (although taken at the 
value of the penalized likelihood estimator), and di = ^rr(Ui + 9J gg \§)- We have that 
_ff$ a = — d g e 2 n |g + 9 gg2'"^ |g- In the case where 9 3^2'"^ is a definite positive matrix, 

we have -H$ e > g 9 f " ,g and < — ( g e f " ,§) so that the correction term is 

smaller than for MLE. The correction term is often interpreted as the equivalent number of 



parameters ([Commenges et al. 2007 1. 



3.3 Hierarchical likelihood estimators 

Let us consider the following model: conditionally on bi, Yi has a density gyi(,(.; 9, bi), 
where bi are random effects (or parameters). The (Yi, bi) are iid, where Y, t is multivariate 
of dimension n*. The bi are assumed to have density gt(.; r) with zero expectation, where 
r is a vector of parameters. Typically Yi is (at least partially) observed while bi is not. 

Estimators of both 9 and b = (61, . . . , b n ) are defined as maximizing the h-loglikelihood 
which can be written: BL(9,b,r) = ± £Li hl(K 2 ; 9, h, r) with hl(F i; 9, b i: r) = g Y (Y t] 9, h) + 
log g b (h;T). 

|Com menges et al.| ( [20TT) showed (via a profile likelihood argument) that the hierarchi- 
cal likelihood estimators for 9 are M-estimators: PHL„(6») = HL(6>, b{9)) = i ^=1 hl ( F i! ^> ^l 61 ))- 
Thus the estimating loss is hl(Yi; 9, bi(9)). Estimators may be compared using UACV. If 
we use the expected cross entropy for assessing the estimators, it is necessary to compute 
the marginal likelihood in 9 and the values of the individual scores. Of course, this is pre- 
cisely what we attempted to avoid by using hierarchical likelihood; we have however to 
do it only once (rather than repeatedly within an iterative algorithm), so it may be com- 
putationally feasible. For the correcting term we have also to compute the Hessian of the 
hierarchical likelihood and the individual scores. 



8 



3.4 Restricted AIC 

Liquet and Commenges (201 1 ) have proposed a modification of AIC and LCV for assess- 
ing estimators based on information O n while the assessment risk is based on a smaller 
information 0' n+1 C O n +i- More specifically, the estimator is based on the sample 
O n — (Yi,i — 1, . . . , n) but the assessment risk is based on a random variable Z which 
is a coarsened version of Y. For instance Z is a dichotomization of Y: Z = ly>i- For 
this case, the restricted AIC (RAIC) was derived by both direct approximation of the risk 
and by approximation of the LCV. RAIC is clearly a particular case of UACV for the case: 
<t>(9,Yi) = -logg 6 (Y i ) and <K<7 9 (^)) = -log^). 

3.5 Case where the estimators are assessed by CRPS 



Gneiting and Raftery| ( |2007| l studied scoring rules and particularly the continuous rank 



probability score (CRPS). Its inverse can be used as a loss function and is defined as: 

/+oo 
{G(u,9)-1 U > Y } 2 du, 
-oo 

where G(., 9) is the cumulative distribution function (cdf) of a distribution in the model. 
The risk is a Cramer-von Mises type distance : d(G,G*) = J{G(u) — G*(u)} 2 du. 
It may be interesting to assess MLE's using this assessment risk. In UACV, the lead- 
ing term is n~ l Y^7=i CRPS* (G(., 6), Yj); for the correcting term, is the Hessian 
of the loglikelihood and K must be computed with = ^g\§ = 2 f_™{G(u,9) — 

dG(u,8) I ■ 

Iu>y\ 30 ' \§ du. 

3.6 Estimators based on continuous approximation of categorical data 

Assume Y is an ordered categorical variable taking values I = 0,1, ... ,L. Here for sim- 
plicity we consider that Y is univariate. Several models are available for this type of vari- 
ables. Threshold link models assume that Y> — I if a latent variable A; takes values in the 
interval (q, for I = 1, . . . , L, with c^ +1 = +oo and Y, : = if Aj < c\\ 

L+l 

Aj itself can be modeled as a noisy linear form of explanatory variables A,; = j3xi + with 
Si having a normal distribution of mean zero and variance a 2 , and where Xi are explanatory 
variables. The parameters are 9 = (ci, . . . , cl, /?, cr). An estimator of the distribution can 
be obtained by maximum likelihood leading to define g e . For assessing the estimator it 
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is natural to use ECE(g ) (defined in[8l. Note that since Y is discrete the densities are 
defined with respect to a counting measure that is, g e (l) defines the probability that Y = I. 

One may also make a continuous approximation which leads to simpler computations 
and may be more parsimonious, especially if Y is multivariate as in the illustration of 
section [6] In this approach we consider the model Yi = j3xi + Ei. Maximizing the like- 
lihood of this model for observations of Yi leads to a probability measure specified by the 
density h%. This is however a density relative to Lebesgue measure. This probability mea- 
sure gives zero probabilities to {Yi = 1} for all I, and this yields infinite value for ECE 
(meaning strong rejection of this estimator). However from h c a natural estimator of /* 
can be constructed by gathering at I the mass around I: tfi(l) = J^jj^M du, for 
I = 1, 1, and 0(0) = hj(u) du, 0(L) = f£™ /2 hj(u) du. UACV can 

be computed for this estimator for estimating its ECE. The first term of UACV(/i^) can 
be interpreted as the loglikelihood obtained by this estimator with respect to the counting 
measure. For the correcting term we need the Hessian of the loglikelihood of hj and we 
have to compute 0< = 9 ^^q \§. For instance if Yi = I for / = 1 . . . , L — 1 we have 

r l + l/2 dhl , 



. J 1-1/ 2 St( U ) du 

/+1/2 
Jl-l/2 k o( U ) du 

Since the denominator is the probability under hj that Ye (I — 1/2, 1 + 1/2), Vi can 
be interpreted as the conditional expectation (under hj) of the individual score. Thus if 
hj does not vary much on (I — 1/2, 1 + 1/2), Vi is close to — {n — l)di. Using the same 
arguments as in section [XT] we obtain that UACV is close to correcting by the number of 



parameters as in AIC, a criterion that we call AIC^. This is what Pro ust-Lima et al. (201 1 ) 
proposed, and this is likely to be a good approximation if the number of modalities of Y is 
large. 



4 Asymptotic distribution and tracking interval 



Commenges et al. (2008 ) using results of Vuong ( 1989) studied the asymptotic distribution 



of a normalized difference of AIC as an estimator of a difference of Kullback-Leibler risks. 
Here similar arguments are applied to study the asymptotic distribution of UACV and a 
difference of UACV. Since UACV and CV differ by an O p (n~ 2 ), this will also give the 
asymptotic distribution of CV and a difference of CV. By the continuous mapping theorem, 
the asymptotic distribution of UACV(g e ) is the same as that of ^(g e "). Since the latter 
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quantity is a mean, it immediately follows by the central limit theorem that: 

nV^UACVCsV-RV )}-^ JV(0,/£), (10) 

where k 2 = vai :t ip(g 00 , F). We can also write: 

nVa^uACV^J-ER*^)}^ jV(0,K*), (11) 

and can be estimated by the empirical variance of ?/>(g e , YA, i = 1, . . . , n. 

Given two estimators, g e and h 1 , it is interesting to estimate the difference of their 
risks: A^(g 8 , 0) = ER^(/) - ER^/i?). The obvious estimator is: D VACV (g § , 0) = 
UACV(s^) - \JACV(0). We focus on the case where g? ^ h" 10 . We obtain in that case 
using the same arguments as above: 

« 1/2 {^uacv(/",^")-A(/",^)}^> AA(0,^ 2 ), (12) 

where uj 2 — var {i/>(g e °,Y) -i>(h~< ,Y)}, and this can be estimated by the empirical 
variance of {<tp(g 8 , - tp(0, 1*)} . 

From this we can compute the so called tracking interval (A n ,B n ), where A n = 
DuAcvigP", h ln ) - z a/2 n~ 1/2 u} n and B n = D VAC v (g Pn , n ) + z a / 2 rT 1/2 Cj n , where 
z u is the u th quantile of the standard normal variable. 

Note that is in general much lower than . This has been shown by Comm enges| 



etal. 



(2007 1 for the cross entropy risk and comes from the fact that ip(g e ,Yi) and ipQfl ,Yi) 



are often positively correlated. 



5 Simulation: choice of estimators for ordered categorical 
data 

5.1 Design 

We conducted a simulation study to illustrate the use of UACV for comparing estimators 
derived from threshold link models and estimators obtained by a continuous approximation 
in the case of ordered categorical data (see section 3.6). The aim was to assess the perfor- 
mance of UACV as an estimator of ECE, and to compare it to the normalized naive AIC 
criterion (noted AIC) and the normalized AIC criterion computed on the counting measure 
(noted AlCd). Performances of these criteria were studied in the case where the number of 



modalities (L + 1) of the response variable Y is small (section 5.2 1, and where L is large 
(section |53) >. 
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True distributions 

For all the simulations, the data came from a threshold model specified by: 

{Y i = l if Aj G [q;c m ) for I = 1,...,L 
Yi = if Aj < c\ and c^+i = +00 
A, = fa + X X\ + 2 X 2 + s z i = l,...,n 

where Si had a normal distribution of mean zero and variance a 2 and the two explana- 
tory variables X 1 and X 2 came from a standard normal distribution. In order to not disad- 
vantage the linear continuous approximation compared to the threshold model, the param- 
eters ci, . . . , Cl were chosen as the solution of the following equations: 

' P{Yi < a) = P(Y t > c L ) 
< P{Y t < ci) = P(ci < < c 2 ), 

c l+ i = Ci + m with m={c L - c x )/(L - 1) 

V, 

The different models 

For each generated sample, we fitted the threshold model as previously defined, and a 
linear model assuming a linear continuous approximation of the response variable Y, Yi = 
00 + 0\X} + 0%X 2 + £j. Both models were estimated by maximum likelihood. 

For all simulations, N = 1000 samples were generated. The true criterion ECE (avail- 
able by simulation) was computed by Monte-carlo approach. 

5.2 Small number of modalities (L = 4) 

We consider here the case where the number of modalities of Y is relatively small (L + l = 
5). In this simulation, we fixed 0o = 1, 0% = —2.1, 02 = —3.7 and a 2 — 4. In Table 
[T] we present, for different sample sizes n, the results for the different empirical criteria 
AIC, AlCrf and UACV which can be compared to the true criterion ECE. For any sample 
size, the threshold model provided a better ECE than the linear model (positive difference). 
It appeared that UACV had a small bias for all the sample sizes (of order 10~ 3 ). The 
two others criteria AIC and AIC<j were also in favor of a threshold model. However, as 
expected, the naive normalized AIC failed to estimate the true ECE risk due to the wrong 
probability measure (Lebesgue measure instead of a counting measure). We can note that 
the criterion AIC^ estimated ECE relatively well, with a bias around 10~ 2 and 10~ 3 . 
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Table 1: Performance of the criteria for a small number of modalities (L + 1 = 5) and 
different sample sizes. Mean over 1000 replications of the difference of criteria UACV, 
AIQ, AIC. ECE is the true risk. Biais of the criteria as estimator of ECE 





ECE 


UACV 


AIQ 


AIC 


Bias UACV 


Bias AIQ 


Bias AIC 


n=300 
















Linear 


1.2650 


1.2661 


1.2702 


1.5998 


0.0011 


0.0053 


0.3348 


Threshold 


0.9874 


0.9835 


0.9836 


0.9836 


-0.0039 


-0.0038 


-0.0038 


Difference 


0.2775 


0.2826 


0.2866 


0.6162 


0.0051 


0.0091 


0.3387 


n=500 
















Linear 


1.2594 


1.2635 


1.2660 


1.6468 


0.0041 


0.0066 


0.3875 


Threshold 


0.9753 


0.9810 


0.9810 


0.9810 


0.0057 


0.0057 


0.0057 


Difference 


0.2840 


0.2825 


0.2849 


0.6658 


-0.0016 


0.0009 


0.3818 


n=3000 
















Linear 


1.2612 


1.2617 


1.2621 


1.6325 


0.0005 


0.0009 


0.3713 


Threshold 


0.9763 


0.9749 


0.9749 


0.9749 


-0.0014 


-0.0014 


-0.0014 


Difference 


0.2850 


0.2868 


0.2873 


0.6577 


0.0019 


0.0023 


0.3727 



5.3 Large number of modalities 

We consider here the case where the number of modalities of Y is relatively large (L + 1 = 
20). In this simulation, we fixed /3 = 1, /? x = -0.3, /3 2 = -1-7 and a 2 = 4. The 
results of this simulation are presented in Table [2] For any sample size, the linear model 
provided a better ECE than the threshold model (negative difference). It appeared that 
UACV had a small bias for all the sample sizes (of order 10~ 3 and 10~ 4 ). The AIQ 
criterion gave similar results as the UACV criterion while the AIC criterion failed to find 
the best estimator (positive difference). 

6 Illustration on the choice of estimators for psychometric 
tests 

In epidemiological studies, cognition is measured by psychometric tests which usually con- 
sist in the sum of items measuring one or several cognitive domains. A common example is 
the Mini Mental State Examination (MMSE) score (Folstei net al.|[T975] l, computed as the 
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Table 2: Performance of the criteria for a large number of modalities (L + 1 = 20) and 
different sample sizes. Mean over 1000 replications of the difference of criteria UACV, 
AlCd, AIC. ECE is the true risk. Biais of the criteria as estimator of ECE 



ECE UACV AIQ AIC Bias UACV Bias AIQ Bias AIC 

n=300 

Linear 2.6794 2.6792 2.6796 2.8076 -0.0002 0.0002 0.1282 

Threshold 2.7104 2.7070 2.7069 2.7069 -0.0035 -0.0036 -0.0036 

Difference -0.0311 -0.0278 -0.0273 0.1007 0.0033 0.0038 0.1317 
n=500 

Linear 2.6765 2.6740 2.6742 2.7117 -0.0026 -0.0024 0.0352 

Threshold 2.6933 2.6899 2.6899 2.6899 -0.0033 -0.0034 -0.0034 

Difference -0.0167 -0.0160 -0.0157 0.0218 0.0008 0.0010 0.0385 
n=3000 

Linear 2.6736 2.6716 2.6717 2.7107 -0.0020 -0.0019 0.0371 

Threshold 2.6746 2.6725 2.6725 2.6725 -0.0021 -0.0021 -0.0021 

Difference -0.0010 -0.0009 -0.0008 0.0382 0.0001 0.0001 0.0391 



sum of 30 binary items (grouped in 20 independent items) evaluating memory, calculation, 
orientation in space and time, language, and word recognition; for this reason it is called a 
"sumscore" and ranges from to 30. Although in essence psychometric tests are ordered 
categorical data, they are most often analyzed as continuous data. Indeed, they usually have 
a large number of different levels and, especially in longitudinal studies, models for cate- 



gorical data are numerically complex. Recently, Proust-Lima et al. (2011 1 defined a latent 
process mixed model to analyse repeated measures of discrete outcomes involving either a 
threshold model or an approximation of it using continuous parameterized increasing func- 
tions. Comparison of models assuming either categorical data (using the threshold model) 
or continuous data (using continuous functions) was done with an AIC^, computed with 
respect to the counting measure. In this illustration, we use UACV to compare such latent 
process mixed models assuming either continuous or ordered categorical data when applied 
on the repeated measures of the MMSE and its calculation subscore in a large sample from 
a French prospective cohort study. 
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6.1 Latent process mixed models 

In brief, the latent process mixed model assumes that a latent process (A,(f))t>o underlies 
the repeated measures of the observed variable ii(tjj) for subject i (i — 1, ...,n) and 
occasion j (j = 1, ...,rii). First, the latent process trajectory is defined by a standard 



linear mixed model ( jLaird and Ware| |l982| l: A 4 (i) = Xi(t) T ' f3 + Zi(t) T bi + ^(iy) for 
t > where -X"j(i) and Zj(t) are distinct vectors of time-dependent covariates associated 
respectively with the vector of fixed effects j3 and the vector of random effects bi (hi ~ 
MVN(n, D)), and ej(t) ~ A/"(0, ct 2 ). We further assume that 6jo. the first component of 
bi that usually represents the random intercept, is N(0, 1) for identifi ability; except for the 
variance of b i0 , D is an unstructured variance matrix. 

Then, a measurement model links the latent process with the observed repeated mea- 
sures. For ordered categorical data, a standard threshold model as defined in |9]) (section 



3.6 1 for the univariate case is well adapted. For continuous data, the link has been modeled 



as H(Yi(tij); rj) = Ai(tij) where H(.; rj) is a monotonic increasing transformation. Three 

families of such transformations are considered: (i) H(y]rj) — — ^ — where 

h(.; r?!, 772) is the Beta cdf with parameters (171,772); (ii) H(y; 77) = 171 + YXX ViBf (y) 

where (-B/)i=2,m+2 is a basis of quadratic I-splines with m nodes; (iii) H(y; 77) = 

which gives the standard linear mixed model. 

Latent process mixed models are estimated within the Maximum Likelihood frame- 
work using lcmm function of lcmm R package. When assuming continuous data, the 
log-likelihood can be computed analytically using the Jacobian of H . In contrast, when as- 
suming ordered categorical data, an integration over the random-effects distribution inter- 
venes in the likelihood computation which is approximated by a Gauss-Hermite quadrature. 
Further details can be found in |Proust et al.| ( |2006| and |Proust-Lima et al.| l |2012") . 

UACV is computed from the loglikelihood '5 obtained for the maximum likelihood 
estimators with respect to the counting measure : 

n /. Tif 



= e / n n ( p (^ = i\k)f Y ^ 1 fmdhi 

2=1 Jb * 7=1 1=0 
n „ m L 

= E / II II ( p ( c; ^ A ^ + e *(^) < c l +i\bi)) lY * i= ' fb(bi)dbi 

i=l j = l 1=0 

where A;(ty) + e t {t tJ ) ~ N(X t (t) T p + Z t (t) T /x, Zi(t) T DZi(t) + a 2 ), c Q = -00, 
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Figure 1: Distributions of MMSE Sumscore and MMSE Calculation Subscore in the 
PAQUID sample (n=2,914). Data were pooled from all available visits for a total of 10,846 
observations. 

C£,_|_i = +00, and either q (I = l,...,L) are the estimated thresholds when a threshold 
model is considered, or c; = H(l——,fj) (I = 1, L) when monotonic increasing families 
of transformations are used. 



6.2 Application: categorical psychometric tests 

Data come from the French prospective cohort study PAQUID initiated in 1988 to study 



normal and pathological aging (Letenneur et al. ( 1994 1). Subjects included in the cohort 
were 65 and older at initial visit and were followed up to 10 times with a visit at 1, 3, 5, 8, 
10, 13, 15, 17 and 20 years after the initial visit. At each visit, a battery of psychometric 
tests including the MMSE was completed. In the present analysis, all the subjects free of 
dementia at the 1-year visit and who had at least one MMSE measure during the whole 
follow-up were included. Data from baseline were removed to avoid modeling the first- 
passing effect. A total of 2914 subjects were included with a median of 2 (Interquartile 
range=3-5) repeated measures of MMSE sumscore. The distributions of MMSE sumscore 
ranging from to 30, and of its calculation subscore, ranging from to 5, are displayed in 
Figure [2] 

The trajectory of the latent process was modeled as an individual quadratic function of 
age with correlated random intercept, slope and quadratic slope (Zi(t) — (1, age^ agef )), 
and an adjustment for binary covariates educational level (EL=1 if the subject graduated 
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from primary school) and gender (SEX= 1 if the subject is a man) plus their interactions with 
age and quadratic age (so that Xi(t) = Zi(t) ® (l,ELj,SEXj)). For MMSE sumscore, 
in addition to the threshold model, the linear, Beta cdf and I-splines (with 5 equidistant 
nodes) continuous link functions were considered. For calculation subscore, in addition to 
the threshold model, only the linear link was considered. 



6.3 Results 

Table [3]gives the assessment citeria for estimators based on the different models, and table 
|4]provides the differences in UACV or AIC and their 95% tracking interval. For the MMSE 
sumscore, the mixed model assuming the standard linear transformation yielded a clearly 
worse UACV than other models accounting for nonlinear relationships with the underlying 
latent process. The model involving a Beta cdf gave a similar risk as the one involving the 
less parsimonious I-splines transformation (Duacv (Beta cdf , I — splines) = —0.0070, 
95% Tracking interval: [—0.0152, 0.0012]). Finally, the mixed model considering a thresh- 
old link model, which is numerically demanding, gave the best fit but remained relatively 
close to the simpler ones assuming a Beta cdf (£>uAcv(Beta cdf, Thresholds) = 0.0200, 
95% Tracking interval: [0.0097, 0.0303]) or a I-splines transformation (Duacv (I — splines, 
0.0270, 95% Tracking interval: [0.0166, 0.0374]). For the interpretation of these values 



Commenges et al. (2008) suggested to qualify values of order 10 1 , 10 2 and 10 3 as 



"large", "moderate" and "small" respectively; moreover for multivariate observations, it 
was suggested to divide by the total number of observations rather by the number of in- 
dependent observations. With this correction (which amounts to divide the current values 
by a factor of 3.7 = 10846/2914) the differences between the linear model and the other 
models can be qualified as "large", and the differences between the threshold model and 
both beta cdf and I-splines are between "moderate" and "small". Figure 2 displays the 
estimated link functions in (A) and the predicted mean trajectories of the latent process 
according to educational level in (B) from the models involving either a linear, a beta cdf, a 
I-splines or a threshold link function. The estimated link functions as well as the predicted 
trajectories of the latent process are very close when assuming either beta cdf, I-splines or a 
threshold link function but they greatly differ when assuming a linear link. This difference 
also shows up in the effects of covariates with associations distorted when not accounting 
for nonlinear transformations (as demonstrated in |Proust-Lima et aT7| ( |201 1) >). For example 
in this application, a significant overall increase with age of the difference between educa- 
tional levels is found when assuming a linear model (p=0.01 1 for interaction with age and 
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(A) Estimated link function 



(B) Predicted trajectory of the latent process 




Figure 2: (A) estimated inverse link functions between MMSE sumscores and the underly- 
ing latent process, and (B) predicted trajectories of the latent process of a woman according 
to educational level (with EL+ and EL- for respectively validated or non-validated primary 
school diploma) in latent process mixed models assuming either linear, Beta cdf, I-splines 
or threshold link functions (PAQUID sample, n=2,914); the trajectories for the latter three 
transformations are indistinguishable 

age squared) while it is a significant overall decrease with age which is found in the other 
models (p-value < 0.001 for interaction with age and age squared). 

For the Calculation subscore also, the standard linear mixed model again gave a clearly 
higher risk than the mixed model assuming a threshold link model (£>uacv (linear, Thresholds) = 
0.4523, 95% Tracking interval: [0.4127, 0.4919]). 

7 Conclusion 

We have proposed a universal approximate formula for leave-one out crossvalidation: it 
is universal in the sense that it applies to any couple of estimating and assessment risks 
which can be correctly estimated from the observations. This is in principle restricted to 
parametric models but extends to smooth semi- or non-parametric ones through penalized 
likelihood. The approximate formula not only allows fast computation, but also allows 
deriving the asymptotic distribution. Estimating this distribution is important since the 
variability of UACV, as that of all the criteria used for estimator choice, is large, even if the 
variability of a difference of UACV between two estimators is smaller. 
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Table 3: Number of parameters (p), individual log-likelihood ($(#)), corresponding naive 
normalized AIC (AIC), log-likelihood computed with respect to the counting measure 
(^(0)), corresponding AIC (AIC^), and UACV for latent process mixed models involving 
different transformations H and applied on either the MMSE sumscore or its calculation 
subscore. 

Transformation H p <$>(§) AIC #(0) AIQ UACV 
MMSE 

Linear 16 -8.7468 8.7523 -8.5231 8.5286 8.5361 

Betacdft 18 -7.7514 7.7576 -7.7803 7.7865 7.7865 

I-splines* 21 -7.8249 7.8321 -7.7857 7.7929 7.7935 

Thresholds 44 -7.7473 7.7624 -7.7473 7.7624 7.7665 

Calculation 

Linear 16 -6.0057 6.0111 -4.8143 4.8215 4.8198 

Thresholds 19 -4.3618 4.3683 -4.3618 4.3683 4.3692 



t cdf for cumulative distribution function 

* Quadratic I-splines with 5 equidistant nodes located at 0, 7.5, 15, 22.5 and 30. 
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Table 4: Difference of AIQ ( DAic d ), difference of UACV and its 95% tracking interval 
between latent process mixed models involving different transformations Hi and H 2 , and 
applied on either the MMSE sumscore or its calculation subscore. 



Transformations Hi 1 iJ 2 




-Duacv 


95% tracking interval 


MMSE 








linear / Beta cdf^ 


0.1 All 


0.7495 


[0.6619 ; 0.8372] 


linear / I-splines * 


0.7357 


0.7425 


[0.6526 ; 0.8325] 


Beta cdft / I-splines * 


-0.0064 


-0.0070 


[-0.0152 ; 0.0012] 


I-splines * 1 thresholds 


0.0306 


0.0270 


[0.0166 ; 0.0374] 


Beta cdff / thresholds 


0.0241 


0.0200 


[0.0097 ; 0.0303] 


linear / thresholds 


0.7662 


0.7696 


[0.6784 ; 0.8607] 


Calculation 








linear / thresholds 


0.4515 


0.4523 


[0.4127 ; 0.4919] 



t cdf for Cumulative Distribution Function 

* Quadratic I-splines with 5 equidistant nodes located at 0, 7.5, 15, 22.5 and 30. 
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In this paper, UACV has been applied to the issue of choice between estimators of 
the distribution of categorical data based on threshold models or on models based on a 
continuous approximation. It has been shown that the naive AIC can be misleading while 
a procedure called AIC^ (which had not been theoretically validated) yields results very 
close to UACV, even if the latter is slightly better. Both quantities can be computed in the 
lcmm R package. 
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